Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert.py: BPE fixes? #2938

Merged
merged 2 commits into from
Sep 3, 2023
Merged

Conversation

KerfuffleV2
Copy link
Collaborator

Trying to get https://huggingface.co/BAAI/Aquila-7B working but it... doesn't feel very close.

  • The main scores were added as int but added_tokens adds them as float: this fails in gguf since arrays have to be homogeneous. Adding it as the wrong time might have caused issues too. Fixed that.
  • Some models like Aquila have added tokens but no separate added_tokens.json. I added some logic to fall back to trying to find the added tokens in tokenizer.json if it's available.
  • At least in tokenizer.json (and the Aquila model I mentioned) the added tokens aren't necessarily unique with the main vocab and have to be filtered out if it's already in the vocab. Maybe added_tokens.json doesn't need this logic?
  • *All tokens were added as USER_DEFINED. However the token id to text rendering stuff in llama.cpp has no case for USER_DEFINED so it would return an empty string 100% of the time. Fixed to add tokens with NORMAL type.
  • Scores were getting added as negative token id. Scores aren't even used for BPE, right? I changed it to just use 0.0 like the Falcon converter. I don't think it makes a difference what they're set to.
  • There's some logic to try to handle models that seem like they use some stuff like <0xblah> for bytes same as LLaMA - for example https://huggingface.co/kfkas/Llama-2-ko-7b-Chat

You won't get anywhere without #2889

That's apparently not enough for the Aquila model to do anything except die. Apparently it has no newline token? We just expect it to exist and blindly look it up:

llama.cpp/llama.cpp

Lines 1752 to 1757 in e8422de

// determine the newline token: LLaMA "<0x0A>" == 10 == '\n', Falcon 193 == '\n'
if (vocab.type == LLAMA_VOCAB_TYPE_SPM) {
vocab.linefeed_id = llama_byte_to_token(vocab, '\n');
} else {
vocab.linefeed_id = llama_tokenize_internal(vocab, "\n", false)[0];
}
- so it'll crash there unless you comment out that line.

Another issue with trying to get the Aquila model to do something is the JSON files are not utf-8, they're GBK apparently. On Linux at least you can convert it with something like iconv -f gbk -t utf8 < vocab.json >vocab.json.tmp and then rename over the original file. tokenizer.json also needs this treatment.

After all that, it sort of works for English text at least. Except it doesn't have (space) in its vocabulary so tokenizing and all output is goingtobeallruntogethergoodtimes. It can't seem to tokenize Chinese characters at all. I also tried without the GBK to UTF8 conversion and pasting in the wonky GBK stuff from the original files but that didn't seem to work either (but I don't know that it would paste correctly since stuff on my system is set to use UTF8).

As for the other Korean model, one problem is it has no vocab.json which convert.py's BPE mode requires but that's pretty easy to extract from tokenizer.json. It actually tokenizes properly (I think?) but only produces garbage output whether you speak Korean or English to it:

Trying Korean
main: prompt: '옛날 옛적에 작은 여우가 있었어요'
main: number of tokens in prompt = 14
 45555 -> '옛'
 44939 -> '날'
    35 -> ' '
 45555 -> '옛'
 44824 -> '적'
 31054 -> '에'
    35 -> ' '
 43089 -> '작은'
    35 -> ' '
 31457 -> '여'
 37719 -> '우가'
    35 -> ' '
 44815 -> '있'
 38248 -> '었어요'
main: static prompt based on n_keep: '옛날 옛적에 작은 여우가 있었어요'

sampling: repeat_last_n = 0, repeat_penalty = 1.040000, presence_penalty = 0.000000, frequency_penalty = 0.060000, top_k = 70, tfs_z = 0.950000, top_p = 2.000000, typical_p = 0.250000, temp = 1.200000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 200, n_batch = 4, n_predict = 500, n_keep = 14


옛날 옛적에 작은 여우가 있었어요는,는의,를가의에ing에가:가>가의에에를가>의가>를의를가는에,의가=가,의,가,가=가,o>t의는,>>r는t>r>o=r,o.w.t.o.o.r.
Trying English
main: prompt: ' Once upon a time, in a dark forest, there lived a little fox'
main: number of tokens in prompt = 32
    35 -> ' '
 26222 -> 'Once'
    35 -> ' '
   786 -> 'up'
   265 -> 'on'
    35 -> ' '
 29874 -> 'a'
    35 -> ' '
  2230 -> 'time'
 29892 -> ','
    35 -> ' '
   262 -> 'in'
    35 -> ' '
 29874 -> 'a'
    35 -> ' '
 26031 -> 'dark'
    35 -> ' '
  1454 -> 'for'
   342 -> 'est'
 29892 -> ','
    35 -> ' '
 12711 -> 'there'
    35 -> ' '
 29880 -> 'l'
  2347 -> 'ived'
    35 -> ' '
 29874 -> 'a'
    35 -> ' '
 29880 -> 'l'
  1992 -> 'ittle'
    35 -> ' '
  8944 -> 'fox'
main: static prompt based on n_keep: ' Once upon a time, in a dark forest, there lived a little fox'

sampling: repeat_last_n = 0, repeat_penalty = 1.040000, presence_penalty = 0.000000, frequency_penalty = 0.060000, top_k = 70, tfs_z = 0.950000, top_p = 2.000000, typical_p = 0.250000, temp = 1.200000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 200, n_batch = 4, n_predict = 500, n_keep = 32


 Once upon a time, in a dark forest, there lived a little fox은2의은의,는이,은에(,는(32의2(이은3에(022에은(의2은,에32,의은2(은이은2(이은의,32(에3의0(에3은(|,은,에(_에_,322,,은,23(023(은은,,은0에3은

This could be merged but hopefully we can talk about the existing problems and maybe figure out a way to fix more stuff first.

@KerfuffleV2
Copy link
Collaborator Author

@klosax since you're the one that suggested testing with that Aquila model: #2842 (comment)

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These change should be OK to merge.

We cannot expect the current built-in BPE tokenizer to work with unicode chars. I've removed this support when porting the reference implementation from: https://github.com/cmp-nct/ggllm.cpp

It's just too much code for my taste.
If we cannot figure out an elegant way to support the BPE tokenizer (i.e. without 1000s of lines of unicode tables and handling), we probably have to drop support for it and require the user code to implement the tokenization. We can potentially expose the vocab through the API if necessary.

@KerfuffleV2
Copy link
Collaborator Author

Thanks for the review.

Let me ask this: Is there a model that convert can handle in BPE mode which is expected to work for running in main, etc? Just from the BPE models other people have mentioned at that I've seen it seems like the LLaMA models adapted to use BPE use it specifically because it deals with unicode characters better than the SPM one which has pretty inefficient tokenization for stuff like Chinese characters.

In other words, is there currently a practical use case for --vocabtype bpe in convert.py? Because if not, that could be confusing. Right now there's no indication to the user that setting the vocab type to bpe is less supported. Perhaps we can add some warnings to the script and change the type to experimental-bpe. Something like that, so users don't just download a BPE LLaMA model and expect it to work.

I don't know too much about which LLaMAs use BPE, which should work, etc. With a little more information I'd be happy to add something like what I mentioned to convert.py. What do you think?

@ggerganov
Copy link
Owner

In other words, is there currently a practical use case for --vocabtype bpe in convert.py?

I don't think so. We support LLaMA models that use the original SPM tokenizer. All other cases are experimental.
BPE is partially implemented and I have tested it only with Falcon-7B and Falcon-40B.

so users don't just download a BPE LLaMA model and expect it to work.

Inference works, it's just the tokenization that does not work.

@KerfuffleV2
Copy link
Collaborator Author

KerfuffleV2 commented Sep 2, 2023

BPE is partially implemented and I have tested it only with Falcon-7B and Falcon-40B.

Falcon models aren't converted using convert.py though. For convert.py specifically, do you think we should do anything different for the BPE mode? Like I mentioned, changing the type to experimental-bpe, adding a warning, etc?

Maybe there should be something like a none vocab type that just doesn't include vocab with the model? (I think that's more complex so would probably be a different pull.)

edit: If you don't want to make further changes like that, please feel free to merge this. I think it's better than the existing behavior at least.

@ggerganov
Copy link
Owner

Ok, let's merge for now. I'm not very sure what to do about Falcon - will figure it out as we go

@ggerganov ggerganov merged commit cff7b0b into ggerganov:master Sep 3, 2023
4 checks passed
@goerch
Copy link
Collaborator

goerch commented Sep 18, 2023

@KerfuffleV2 : please check if #3252 improves the situation. And yes, I'm still converting Falcon-7B with convert-falcon-hf-to-gguf.py. Is there a better way?

@KerfuffleV2 KerfuffleV2 deleted the fix-convert-bpe branch November 17, 2023 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants